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ABSTRACT 


\ _ _  A  random  sample  Is  taken  from  a  population  consisting  of  an  unknown 

nxindser  of  distinct  species.  A  quantity  of  interest  is  the  probability  of 
discovering  a  new  species  when  an  additional  draw  from  the  population  is 
made.  An  estimator  of  this  quantity  was  introduced  by  Starr  (1979).  'VWe  prove 
a  conjecture  of  Starr's  that  the  estimator  is  uniformly  minimum  variance 
unbiased  and  give  various  asymptotic  properties  of  the  estimator.  A 
nonparametric  maximum  likelihood  estimator  is  introduced  which  has  similar 
asynq>totic  properties.  A  Monte-Carlo  study  is  given  which  suggests  guidelines 


for  choosing  an  estimator  under  various  circumstances. 


AMS  (MOS)  Subject  Classifications:  Primary  92A10;  Secondary  62G05,  62P10 

Keywords  and  Phrases:  U-statistics,  nonparametric  maximum  likelihood 

estimator 

Work  Unit  Number  4  (Statistics  and  Probability) 


*Department  of  Statistics,  1210  West  Dayton  Street,  University  of  wisconsin- 
Madison,  Madison,  WI  53706. 


Sponsored  by  the  United  States  Army  under  Contract  No.  DAAG29-80-C-0041 . 


NONPARAMETRIC  ESTIMATION  OF  THE  PROBABILITY  OF 


DISCOVERING  A  NEW  SPECIES 

*  * 

Hurray  K.  Clayton  and  Edward  W.  Frees 


^ 1 .  Introduction 

In  many  ecological  studies «  a  population  Is  sampled  to  determine  the 
number  of  species  that  exist  In  the  population.  This  quantity  provides  a 
partial  description  of  the  population  and  may  be  used  In  the  comparison  of 
populations  over  time  or  space.  Such  sampling  often  takes  place  sequentially 
and  It  Is  In  this  context  that  a  related  quantity  arises:  the  probability  of 
discovering  a  new  species  In  a  future  san^le  based  on  sampling  which  has 
already  taken  place.  By  Itself,  this  probability  Indirectly  leads  to  Informa¬ 
tion  about  the  number  of  species  In  the  population;  It  might  also  be  used  in  a 
sequential  sampling  scheme  where  the  goal  Is  to  decide  when  to  stop  sampling. 

To  describe  the  problem  formally,  consider  a  population  composed  of  dis¬ 
tinct  species  and  use  M^^  to  represent  the  1^^  species,  1*1,2,....  He  assume 
that  the  species  have  no  natural  order  and  that  the  number  of  species  may  be 
countably  Infinite.  Suppose  n  Independent  drawings  are  made  from  the  popula¬ 
tion,  with  replacement  If  the  population  Is  finite,  and  define  =  l  when  the 
draw  is  from  M^.  Let  x”  -  I(X^  *  1)  be  the  number  of  representatives 

of  the  species  in  n  drawings,  where  1(A)  as  the  indicator  function  of  the 
set  A.  The  conditional  probability  of  discovering  a  new  species  in  one  addi¬ 
tional  search  Is 
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(1.1) 


1  p  I(x"  =  0), 

i  ^ 

where  =  P(X^  =  i).  The  corresponding  unconditional  probability  of  new 
species  discovery  Is 

*n  ■  -  \  '■i<-  “•'> 

where  =  1  -  p^. 

As  argued,  for  example,  by  Starr  (1979),  standard  statistical  procedures 

for  direct  estimation  of  a  realization  of  the  random  variable  U  are  inade- 

n 

quate.  An  alternative,  and  closely  related,  goal  Is  the  estimation  of  the 

parameter  0  .  Estimation  of  6  has  attracted  Interest  in  the  recent  litera- 
n  n 

ture;  for  example,  Starr  (1979),  Chao  (1981,  correction,  1982),  and  Banerjee 

and  Sinha  (1985)  have  recently  introduced  estimators  of  6  .  For  earlier 

n 

efforts  on  this  and  related  problems,  see  Good  (1953,  1965),  Good  and  Toulmin 
(1956),  Goodman  (1949),  Harris  (1959,  1968),  Knott  (1967)  and  Robbins  (1968). 
We  note  that  our  model  Is  not  confined  to  sampling  species  from  populations; 
related  problems  are  discussed  In  Efron  and  Thlsted  (1976)  and  others.  The 
sequential  problem  mentioned  above  Is  discussed  In  Goodman  (1953),  Rasmussen 
and  Starr  (1979),  and  Banerjee  and  Sinha  (1985).  A  Bayesian  approach  can  be 
found  in  Hill  (1979). 

Without  additional  constraints  on  the  model,  it  is  well  known  that  there 

Is  no  unbiased  estimator  of  0  based  on  a  sample  size  less  than  nt1  (cf., 

n 

Appendix  A,  Lemma  A.1).  However,  If  one  additional  search  Is  made,  Robbins 
(1968)  noted  that 

=  (n  +  1)"^  I  Kx""^’  =  1)  (1.3) 

1 

is  an  unbiased  estimator  of  0  .  Robbins  also  argued  that  V.  follows  tl  in  the 

n  in 


sense  that  the  expected  squared  difference  is  strictly  bounded  from  above  by 


-  — » .  <  J ^ \f  ■>  Kl  MJ  i.V  ij.  il.  ■ 


V  ■  'i  •V  TV*V*rTV'j^. 


fc 
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<  •  .  •  *  •  .  *^ 


(n  +  1)  \  Starr  (1979)  gave  a  more  general  version  of  the  Robbins  estimator. 
Starr  supposed  that  the  initial  search  of  size  n  was  extended  by  m  additional 
stages  and  defined 

m 


m 


(1.4) 


k-1 


The  term  I(X^  ”*  =  k)  is  the  number  of  species  with  k  representatives  and  is 

a  part  of  the  so-called  "sampling  frequency  of  frequencies"  (cf..  Good, 

1953 ) .  It  is  important  in  applications  because  only  the  summary  statistics 

®  need  to  be  retained  for  analysis.  Starr  showed  that  is 

the  unique  unbiased  estimator  which  is  a  linear  combination  of 
fv  n+m  in+m 

I(X^  =  k) and  conjectured  that  it  is  the  minimum  variance  unbiased 
estimator  (MVUE).  This  property  was  discussed  by  Chao  (1981)  who  proposed  an 
alternative  estimator  which  was  further  modified  by  Saner jee  and  Sinha  (1985). 
Chao's  estimator  was  motivated  by  Harris's  (1968)  work  in  the  important 
special  case  of  equal  cell  probabilities. 

In  §2  we  answer  the  issues  raised  by  Chao  (1981,  1982)  and  Saner jee  and 
Sinha  (1985)  by  proving  Starr's  conjecture  tliat  is  the  MVUE.  The  technique 
is  to  use  some  results  of  Halmos  (1946)  on  unbiased  estimation  and  show  that 
is  a  U-statistic.  Several  other  properties  of  are  also  immediately 
available  liased  on  the  theory  of  U-statistics  and  are  described  in  §2.  In  §3 
we  introduce  a  nonparametric  maximum  likelihood  estimator  (NPMLE)  as  an  alter¬ 
native  to  Vjij.  Although  the  NPMLE  is  biased  in  finite  samples,  we  show  that  it 
has  similar  large-sample  properties.  Some  heuristic  arguments  in  addition  to 
the  simulation  results  of  §4  suggest  that  the  NPMLE  may  be  a  desirable  alter¬ 
native  to  Starr's  estimator  in  certain  situations.  We  close  in  §5  with  some 
general  remarks.  Appendices  A  and  B  provide  details  of  the  proofs  of  the 
technical  results  of  §2  and  §3,  respectively. 
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For  convenience,  we  begin  by  stating  some  results  of  Halmos  (1946).  A 
direct  consequence  of  these  results  Is  the  verification  of  Starr's  conjecture. 
Another  consequence  Is  that  Vj^,  defined  In  (1.4),  Is  a  U-statlstlc.  This 
property  has  further  consequences  which  we  exploit. 

To  state  Halmos' s  results,  define  H*  to  be  the  class  of  all  probability 
distributions  on  R,  the  real  line.  Let  E  be  a  Borel  subset  of  R.  Define  n(E) 
to  be  the  class  of  all  P  e  II*  that  assign  probability  to  some  finite  subset  of 
E  and  let  H  be  some  subset  of  H*  that  contains  n(E).  For  each  P  e  11,  let 

be  an  1.1. d.  random  sample.  Let  {l^,...,l^}  be  a  subset  of  size  k 
of  {i,2,...,n}  and  let  \  be  the  sum  over  all  (^)  distinct  combinations  of 
{i^, . . . ,ij^}.  A  linear  functional  F(P)  is  said  to  be  homogeneous  of  degree  k 
if  there  exists  a  mapping  h  from  R  to  R  such  that 

F ( P )  ®  Ep  h(X^, . . . , Xj^ )  “  /  ...  /  h( ,  ♦  •  • , )  dp ( x^ )  ...  dp ( x^^ ) 

for  all  p  E  n  and  if  the  integer  k  is  minimal. 

Lemma  2.1  (Halmos,  1946,  Theorems  3  and  5) 

I.et  F(P)  be  homogeneous  of  degree  k  over  II  with  F(P)  =  Ep  h(X^, . . .  ,Xj^) . 

(a)  If  f  (X^ , . . .  ,Xj^)  is  a  symmetric,  unbiased  estimate  of  F(P),  then  for 

every  point  (x^,...,x^)  with  x^  E  E,  f(x^,...,x^)  =  (^)  ^  h(X^  ,...,X^  ). 

1  k 

(b)  Among  all  unbiased  estimators  of  F(P),  f^)  ^  T,  h(X  ,...,X  )  has 

K  ^  1  \ 

minimum  variance. 

To  prove  Starr's  conjecture,  define  E  =  {l,2,...,  },  N  =  n+m  and  let  H  be 
the  set  of  all  probability  distributions  defined  on  E.  We  shall  find  the  form 
of  h( • )  which  is  appropriate  for  this  application.  To  motivate  the  discus¬ 
sion,  we  note  that  the  indicator  of  the  i^’^  species  having  one  representative 


can  be  expressed  by 


I(X 


n+1 

1 


1) 


n+1  n+1 

I  i(x  -  i)  n  i(x  ft  L). 
J-1  ^  k-1 

X»‘j 


(2.1) 


We  use  the  kernel  function  of  size  n+1  defined  by 


h(X,,...,X_^,)  -  (n+1)"^  I  I  1(X.  -  1)  n  l(X.  1), 


n+1 


n+1 


n+1  ‘ 


(2.2) 


1  J-1 


k-1 

)c?‘j 


that  Is,  the  proportion  of  species  with  one  representative.  It  Is  easy  to  see 

titat  h(  ■ )  Is  symmetric  and  unbiased  for  6  .  The  proof  that  6  Is  homogeneous 

n  n 

of  degree  k  -  n+1  over  11  Is  standard  and  Is  given  in  Appendix  A  (liemma  A.1). 
Thus,  by  Lemma  2.1  we  Immediately  have  the  following  properties. 


Property  2.1 

The  statistic  is  a  U-statistic  with  kernel  h(»)  and  degree  n+1,  i.e.. 


^m- 


/n+mi-l 

^n+1 


h(X  ,...,X  ). 

1  n+1 


(2.3) 


Property  2.2 

Based  on  a  random  sample  of  size  n+m,  V_  is  the  MVUE  for  6  over  H. 

OT  n 


A  consequence  of  Property  2.2  Is  that  has  desirable  properties  as  an 

estimator  of  8  for  any  fixed  number  m  additional  searches.  If  the  number  of 
n 

additional  searches  is  large,  from  Property  2.1  and  the  theory  of  U-statistics 

it  immediately  follows  that  V  ♦  9  with  probability  one,  as  m  ■*•  ".  Thus,  the 

n  n 

estimator  converges  to  the  parameter  of  interest.  The  rate  of  convergence  can 


further  be  described  by 


Define  -  {n+1)^[^^  ^ “  <1^  P^q"  ^  (np^-q^^)  )^ ) .  Then, 

-1/2  -1/2 

V  =  6  +  (n+m)  a  Z  +  o  ((n+m)  )  (2.4) 

m  n  p 

as  m  where  Z  is  a  standard  normal  random  variable. 

Remark:  The  proof  of  Property  2.3  is  standard  in  the  theory  of  U-statistics 
(of..  Serf ling,  1980,  page  192).  One  only  needs  to  check  the  calculation  of 
the  asymptotic  variance  which  is  provided  in  Appendix  A  (Lemma  A.2).  Perhaps 
the  most  interesting  aspect  of  Property  2.3  is  the  fact  that  in  the  case  of 
equal  species  probabilities,  it  can  easily  be  shown  that  0  -  0.  Indeed,  by 
another  application  of  U-statistic  theory  in  Appendix  A,  we  have 

Property  2.4. 

Suppose  p^  *  P2  “  •••  “  ^i/y  “  ^  some  g  >  0.  Then, 

V  =  0^  +  (n+m)"”'("V)g(1-g)""^(g-2(n+1)"'*)(x^-1)  +  o  ((n+m)"S 
m  n  ^  p 

2 

as  m  ♦  where  X  is  a  chi-square  random  variable  with  1  degree  of  freedom. 

Thus,  the  rate  at  which  approaches  0^  in  the  Important  special  case  of 
equal  probabilities  is  of  a  different  order  of  magnitude  (with  respect  to  weak 
convergence  to  a  nondegenerate  distribution)  than  the  general  case.  This 
characteristic  is  important  since  a  comparison  of  various  alternative  esti¬ 
mators  in  this  special  case  can  be  misleading  when  drawing  conclusions  about 
their  relative  performance  in  the  more  general  set-up  of  unequal  probabili¬ 
ties.  In  other  situations,  Starr  (1979),  Chao  (1981),  and  Banerjee  and  Sinha 
(1985)  use  the  equiprobable  case  as  examples  of  their  results.  It  should  also 


be  noted  that  the  equiprobable  cells  model  is  unlikely  to  arise  in  nature  when 


sampling  for  species  although  It  arises  naturally  In  the  cataloging  problem 
of,  for  example,  Harris  (1959). 


§3.  An  Alternative  Nonparametrlc  Estimator 

Starr's  estimator  is  attractive  computationally,  since  it  is  the 
linear  combination  of  the  "frequency  of  frequencies,"  and  it  has  desirable 
theoretical  properties  since  it  can  be  described  as  a  U-statistic.  However, 
because  it  is  derived  from  summary  statistics,  there  may  be  some  loss  of 
information  in  a  finite  number  of  additional  searches,  in  some  sense.  For 
example,  if  we  set  m  =  1,  then  from  (1.3)  we  see  tliat  is  the  sample  propor¬ 
tion  of  species  with  one  representative.  Note  that  this  estimator  treats 
species  with  0, 2, 3, . . . ,n+1  representatives  equally.  Motivated  by  these 
heuristic  arguments,  we  introduce  the  following  nonpar ametric  estimator  of  6 

n 

based  on  an  initial  sample  size  n  and  additional  search  m.  Define 

p,  =  (n+m)  1.  .  I(X.  *  i)  and  q,  *  1  -  p,  ,  1  *  1,2,....  The  NPMLB  of  B  is 

1  j  1  1  n 

defined  to  be: 


e 

m 


(3.1) 


Unlike  V  ,  9  is  a  biased  estimator  of  9  .  Since  (n+m)q.  is  a  binomial  random 
mm  n 

variable,  it  is  straightforward  to  explicitly  write  out  the  bias  as  a  linear 
combination  of  powers  of  q^^  and  Stirling  numbers  of  the  second  kind.  Finite 
saunple  properties  of  9  are  further  discussed  in  §4.  Asymptotically  (as 

Rl 

A 

m  +  *) ,  9  behaves  similarly  to  V_.  By  the  strong  law  of  large  numbers,  with 
in  O' 

A  A 

probability  one,  q.  q.  ,  and  it  is  not  hard  to  show  that  9  •'■9  with  proba- 

i  i  m  n 


bility,  one  as  m  + 


We  also  have  the  following  two  asymptotic  properties. 


Let  0  be  as  defined  in  Property  2.3.  Then 


9=9+  (n+m)  o  Z  +  o  f(n+ra) 
m  n  p 


as  m  ■►  “>. 


Property  3.2 

Suppose  p^  “  P2  ~  “  ^l/U  ~  ^  some  u  >  0.  Then, 

®m  "  ®n  (n+m)"‘'("2M(1-U)"“^(M-2(n+1)"‘' )(p(x^-1)  +  (l-M)) 

+  o  ( ' n+m )  ^ ) 

P 

as  m 

The  proof  of  Properties  3.1  and  3.2  are  in  Appendix  B.  CoT!5>aring  Proper- 

A 

ties  2.3  and  3.1,  we  see  that  and  9^  are  asymptotically  equivalent  to  the 

first  order  (i.e.,  (n+m)“^/^).  An  advantage  of  the  NPMLE  8  is  that,  since 

in 

strongly  consistent  estimators  of  and  hence  0  can  be  constructed,  we  have 
as  an  immediate  corollary  of  Property  3. 1  large  sample  interval  estimates 

A 

of  9^.  Comparing  Properties  2.4  and  3.2,  we  see  that  and  9^  are  of  the 
same  order  of  magnitude  and  have  same  variance  in  their  respective  asymptotic 

A 

distributions.  The  estimator  is  slightly  superior  to  0^  in  the  sense  that 

A 

the  asymptotic  distribution  of  V  -9  has  mean  zero  unlike  that  of  9  -9  .  We 
^  m  n  m  n 

remark  that  in  this  special  case  of  equiprobable  cells,  Chao's  (1981) 
extension  of  Harris's  (1968)  estimator  is  MVUE  for  fixed  m  and  hence  is  a 

A 

Strong  competitor  to  and  9^. 

As  noted,  the  rate  of  convergence  of  Vjjj  and  9^  is  markedly  different  in 
the  equiprobable  case  in  comparison  to  the  general  case.  Moreover,  in  some 


sense  the  equiprobable  case  is  the  only  one  in  which  this  can  happen.  Specifi¬ 
cally,  we  have  the  following  result. 


Property  3.3 


Consider  o  defined  in  Property  2.3  and  suppose  that  the  number  of 

2 

specj.es  exceeds  n.  Then,  a  »  0  if  and  only  if  p.j  =*  P2  =  . . .  = 
some  M  >  0. 


|4.  Small  Sample  Properties 

In  this  section  we  investigate  the  behavior  of  Starr's  estimator,  V^,  and 

jk 

the  NPHLE,  d  ,  when  m  is  small  via  a  Monte-Carlo  simulation.  We  look  at  their 
n 

bias  and  mean  square  error  as  estimates  of  6^  and  make  some  comments  regarding 

A 

modifications  of  which  have  desirable  properties.  Finally,  we  investigate 

A 

modifications  of  V  and  6  suitable  for  use  when  m  »  0.  All  computations  were 
done  on  a  VAX  11/750  owned  and  operated  by  the  Department  of  Statistics  at  the 
University  of  Wisconsin-Madison.  The  simulations  were  performed  using  the 
National  Bureau  of  Standard's  Core  Math  library  (CMLIB)  pseudo-uniform  random 
number  generator  UNI. 

Two  classes  of  distributions  were  used  to  construct  the  probability  distri¬ 
bution  {p^;  i  >  l}.  These  were:  (1)  equiprobable,  with  p^  «  u,  1  <  i  <  1/P ; 
and  (2)  truncated  geometric,  with  p^  =  qp^”  V(  l-p*^) ,  1<i<c,  0<p<  1, 
q  =•  1  -  p.  For  the  equiprobable  cells  model,  values  of  y  *  .1,  .02,  .01  were 
used;  for  the  truncated  geometric  model,  values  of  p  *  .1,  .5,  .9  and  c  =  10, 

100  were  used.  For  each  assignment  of  (p|^},  9^  was  determined  and  1,000  simu¬ 
lations  were  performed.  For  each  simulation,  this  involved  drawing  a  sample 
of  size  n  and  a  subsequent  sample  of  size  m.  The  pairs  (n,m)  =•  (10,1), 


(10,10),  (50,1),  (50,10),  (50,50)  were  included.  For  each  sample,  0  and  V 

tn  rti 

A 

were  computed.  Tables  4. 1-4.2  show  the  mean  values  of  0  and  V  over  the 

tn  ni 


-9 


1,000  samples  denoted  in  the  tables  as  e6  and  EV  ,  respectively.  (The  rows 

m  m 

corresponding  to  m  =  0  will  be  discussed  below.)  In  addition,  the  estimated 

A 

root  mean  square  error  of  the  estimates,  denoted  by  RMSE  (0  )  and  RMSE  (V_), 

in 

respectively,  are  given  in  Tables  4. 1-4. 2.  Of  course,  since  is  unbiased, 

RMSE  (V„)  is  also  an  estimate  of  the  standard  error  of  V_. 
m  m 

Generally,  in  the  equiprobable  case,  has  lower  root  mean  square  error 

A 

than  0^.  Comparing  Properties  2.4  and  3.2,  we  have  up  to  order  {n+m)~\ 

E(V  -0  )^  =  E(0  -0  2p^/(3li^-2g  +  1 ).  Thus,  for  p  small,  RMSE  (V_)  will  be 

™  n  m  n  m 

2  “ 

approximately  2p  times  RMSE  (0  ) .  While  the  differences  in  RMSE  for 

m  m 

and  0^  in  Table  4. 1  are  not  all  of  this  magnitude,  we  do  see  that  is  a 

better  estimator  of  0  in  terms  of  RMSE. 

m 

The  situation  is  reversed  to  a  large  extent  when  the  truncated  geometric 
is  used  for  p^  .  These  results  appear  in  Table  4.2.  It  is  evident  in  this 

A 

case,  as  in  Table  4.1,  that  6  tends  to  underestimate  0  and  that  the  bias  can 

m  n 

be  considerable.  From  the  results  of  sections  2  and  3,  we  expect 

A 

0^  and  to  have  the  same  asymptotic  mean  square  error.  From  Table  4.2,  it 

appears  that,  when  p  is  not  too  large,  the  mean  square  error  of  0  is  less 

m 

than  sometimes  considerably  so.  That  this  can  fail  when  p  is  large  is  not 
surprising  since  the  truncated  geometric  distribution  tends  to  the 
equiprobable  case  when  p  tends  to  one.  Specifically,  qp^/d-p*^)  -►  1/c  for 
each  i  as  p  1. 

That  9  dominate  V„  in  terms  of  the  truncated  distribution  when  p  is  small 

2 

can  be  seen  in  an  example:  Ijetc  =  2,  m  =  1,  and  n  =  2,  so  p^  =  q(1-p  )  and 

2  ^ 
p_  =  qp/(1-p  ).  Then  9  =  p,P-  and  it  is  easy  to  show  that  E0,  =  2/3  p.p_, 

^  2  1  Z  I  I  2 

which  represents  a  considerable  bias.  For  this  example  it  can  be  shown  that 

=  (4p^P2-9p^  P2)/27  and  ^  1^2^  ^  l'’^2^  ^  follows  that 

2^2  1 

ECV^-e^)  >  E(9^-02)  if  p^  <  -J  -  ’^/66,  or  equivalently,  if  p  <  .5799. 
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Table  4.2  (continued) 


P  c 

.1  100 


.5  100 


.9  100 


A 

A 

* 

A 

n 

m 

6 

n 

RMSEV^ 

ni 

E6 

m 

RMSEO 

m 

BO 
_ m 

RMSEO 

m 

10 

0 

.0443 

.0494 

.0566 

.0214 

.0293 

.0428 

.0367 

10 

1 

.0443 

.0388 

.0498 

.0208 

.0297 

.0396 

.0348 

10 

10 

.0443 

.0438 

.0270 

.0310 

.0213 

.0465 

.0251 

50 

0 

.0075 

.0078 

.0108 

.0047 

.0051 

.0093 

.0087 

50 

1 

.0075 

.0075 

.0106 

.0045 

.0051 

.0089 

.0083 

50 

10 

.0075 

.0077 

.0093 

.0049 

.0051 

.0090 

.0081 

50 

50 

.0075 

.0074 

.0057 

.0057 

.0042 

.0086 

.0058 

10 

0 

.1312 

.1471 

.1057 

.0719 

.0691 

.1438 

.0720 

10 

1 

.1312 

.1305 

.0951 

.0759 

.0656 

.1448 

.0686 

10 

10 

.1312 

.1318 

.0543 

.0982 

.0476 

.1473 

.0539 

50 

0 

.0283 

.0290 

.0212 

.0163 

.0141 

.0327 

.0157 

50 

1 

.0283 

.0277 

.0205 

.0163 

.0142 

.0322 

.0152 

50 

10 

.0283 

.0284 

.0163 

.0180 

.0126 

.0329 

.0140 

50 

50 

.0283 

.0291 

.0108 

.0222 

.0093 

.0333 

.0116 

10 

0 

.6095 

.0269 

.1890 

.2505 

.3628 

.5011 

.1511 

10 

1 

.6095 

.6150 

.1792 

.2771 

.3368 

.5220 

.1319 

10 

10 

.6095 

.6084 

.1051 

.4011 

.2161 

.6017 

.0861 

50 

0 

.1855 

.1856 

.0533 

.1030 

.0846 

.2060 

.0428 

50 

1 

.1855 

.1856 

.0509 

.1059 

.0016 

.2098 

.0435 

50 

10 

.  1855 

.1857 

.0447 

.1155 

.0725 

.2117 

.0437 

50 

50 

.1855 

.1857 

.0269 

.1410 

.0477 

.2115 

.0367 

'  v"  I*-"  «L* 


While  6  may  be  an  attractive  estimator  in  the  truncated  geometric  case 
m 

in  terms  of  its  mean  square  error,  it  has  already  been  noted  that  its  bias  can 
be  considerable.  In  fact, 

E(9m)  =  ®n 

This  suggests  that  the  quantity  6^  +  (n+m)  ((  ^  )®„“  ^2^^i  ^i^  ^  would  be 

A 

a  better  estimator  of  6  than  6  alone.  For  the  size  of  the  seunples  discussed 

n  m 

here,  ^1%  tends  to  underestimate  too  severely  and  a  better  esti- 

^  ^  ^  A 

r  n**1 

mator  can  be  obtained  by  replacing  ^  ®ni'  leading  to  the  estimator 

* 

6  =0  (1  +  n/(n+m) ). 

n  Ri 

*  *  * 

Values  of  E(6  }  and  RMSE  (6  )  are  given  in  Tables  4. 1-4.2.  Generally,  6  has 
n  in  n 

good  bias  properties  and  compares  favorably  with  in  terms  of  RMSE,  even  for 
the  equi probable  case. 

It  should  be  noted  that  9  and  V_  are,  in  some  sense,  "retrodictors." 

m  ® 

That  is,  they  predict,  on  the  basis  of  n+m  observations,  what  would  be 
observed  for  the  last  m  observations.  In  Starr  (1979),  an  argument  is  given 
that  this  is  not  a  vacuous  exercise;  Vj^  can  be  used  effectively  to  predict,  on 
the  basis  of  an  initial  sample  size  n  and  a  subsequent  sample  of  size  m,  what 
will  occur  in  a  large  future  sample  of  size  M.  This  argument  applies  equally 

A 

well  to  the  NPMLE  9^.  However,  it  can  be  argued  that  the  principle  Interest 

A 

of  estimators  such  as  9^  and  is  in  their  properties  as  true  predictors.  Foi 

example,  Rasmussen  and  Starr  (1979)  used  the  estimator  V  =  n  ^  Vf  I(X.  =1) 

o  11 

to  consider  a  rule  for  sequentially  sampling  a  population.  Similarly,  the 

*  * 

estimators  9  and  9  could  also  be  used  in  such  a  capacity.  We  leave  the 

examination  of  such  sequential  rules  to  a  future  paper  and  consider  here  only 

*  * 

the  properties  of  V  ,  0  and  9  as  estimates  of  9  .  Simulation  results  appear 
o  o  o  n 


14. 


in  Tables  4.1**4.2«  In  terms  of  mean  square  error,  again  we  see  that,  in  the 

*  * 

equlprobable  case,  dominates  6^  and  that  6^  compares  favorably  with  V^.  in 

*  * 

the  truncated  geometric  case,  both  and  6^  dominate  except  when  p  Is  near 

A 

one  In  which  case  V.  tends  to  be  a  better  estimator  than  0  . 

w  O 


SS.  Summary  and  Discussion 

This  paper  has  focused  on  nonparametrlc  estimators  of  6^,  the  probability 
of  discovering  a  new  species*  We  have  shown  to  be  a  minimum  variance 
unbiased  estimator  with  a  high  rate  of  convergence  In  the  equlprobable  case. 

A 

The  nonparametrlc  maximum  likelihood  estimator,  0  ,  has  similar  asymptotic 

n 

A 

properties.  In  small  seunples,  is  a  better  estimator  than  0  In  the  equl- 

®  n 

probable  cell  case  with  respect  to  mean  square  error)  this  Is  reversed  for 
truncated  geometric  distributions  when  p  Is  not  large.  An  estimator  with 
somewhat  less  bias  than  0^  Is  0^,  defined  In  ( 4. 1 ) ;  It  compares  favorably  with 
In  terms  of  mean  square  error. 

A 

Besides  the  theoretical  Interest  In  0  as  an  estimator  which  competes 
well  with  In  the  truncated  geometric  case,  we  argue  that  this  has  practical 
Implications.  For  example,  data  collected  by  Andrews  (1985)  of  the  species 
abundance  of  epiphytic  fungi  on  apple  leaves  fit  a  truncated  geometric  distri¬ 
bution  quite  well  with  p  ■  .77.  Arguments  are  given  by  Plelou  (1977)  that  a 
geometric  distribution,  or  more  generally,  a  negative  binomial  distribution  Is 
appropriate  In  some  situations  for  modeling  species  distributions.  It  remains 

A 

to  be  seen  how  V  and  0  compare  over  a  wider  class  of  distributions. 

m 


Lemma  A. 1 


The  parameter  9^  is  homogeneous  over  H  and  is  of  degree  n  +  1 . 

Proof ; 

Sufficient  for  the  proof  is  to  show  that  is  homogeneous  of  degree 

k.  Since  Ep(I^  I(X^  *  i-))  tor  all  P  E  n,  we  have  that  is 

homogeneous  of  degree  ^  k.  We  now  suppose  that  q^^  is  homogeneous  of  degree 
h  and  show  that  h  >  k.  Thus,  assume  there  exists  ^(x^,...,x^)  so  that 

li  q^  =  Ep(4«(Xi,...,Xh))  (A-1) 

for  all  pen.  Suppose  is  a  subset  of  n  so  that 

P  (1 )  =  q  and  P  (2)  =  1  -  q 

q  q 

and  n  =ip  en,0<q<l}.  With  the  choice  of  P  the  left-hand  side  of 
1  q  ^ 

(A.1)  is  a  polynomial  in  q  of  degree  k  while  the  right-hand  side  is  a  poly¬ 
nomial  in  q  of  degree,  say,  h,  <  h.  Since  these  polynomials  are  must  be  of  the 
same  degree,  we  have  k  =  h^  ^  h.  ^ 

Define  =  E(h(X, , . . .  ,Xjj)  |  X, )  -  9^.  The  proof  of  Property  2.3  is 

complete  with  =  (n+1)^  Var(h^^(X^))  and  the  following 


Lemma  A. 2. 


r  -1  2  2n-2  ^2 

■(hi^(X^))  =1  (p^  -  (n+1)  )  p^q^  -  Op  -  (l+n  )  9^_^)  . 


Proof : 


Use  (2.2)  to  get 


(h(X  ,...,X  .,)|xj  =  (n+1)'^  I  q"'  fnp^(X^  i)  +  qj^KX,  =  i)}. 


Thus*  by  rearranging  terna 


hin(Xi)  -  (n+1)“^  I  -  (n+1)"S(p^  -  I(X^  -  1)). 

Hence , 

B  -  (n+1)"^  fl  -  (n+l)"'')^ 

-  p^q^  q"”^q“”^Pl^  -  (n+1)’S(p^  -  (n+1)  S} 
which  gives  the  result  upon  a  rearrangement  of  terms •  f 

To  prove  Property  2<4,  we  need  to  examine  the  properties  of  the  following 
projection  of  h, 

hj„(X,,Xj)  .  E(h(X, . -  Nn'*!’  ■  V*2’  '  ®n 

.  (1-U)”"*(>l-2(n*1)'')(l(X^.Xj)  -  u).  (».2) 

To  eee  (A. 2),  first  note  that  it  is  easy  to  check  that  0^^  ••  (1  ”  P)”  and 
that  h^j^(X^)  -  use  (2.2)  to  got 

-  (n+1)“\l-p)""^  I  {(n-1)w  I(X,  ^  i)  I(X  ^  i) 

i  ^  ^ 

+  (1-P)(I(X^  -  i)  I{X2  i)  +  I(X^  ^  i)  KX^  -  i))} 

-  -  2pn/(n+t)  +  (p-2(n+^)■^  I(X^  -  X^)} 

after  some  algebra.  Subtracting  9^  yields  (A. 2).  The  proof  of  Property  2.4 
is  now  an  application  of  a  result  independently  due  to  Gregory  (1977)  and 
Serf ling  (see  Serf ling,  1980,  page  192). 


Proof  of  Property  2.4; 

Let  K  =  (1-U)”"^(y-2(n+1)”S  so  that 

Innaedlate  that  Var(h2j^(X^,X2) )  =  K^Wd-W)  >  0.  Now,  let  g  be  an  arbitrary, 
measurable  function  such  that  E(g(X))  <  *  and  let  x,X  be  real  constants, 

forms  of  g( • )  and  X  satisfying 

Xg(x)  =  E{h2^(x,X)g(X)} 

=  Ky{l^  I(x=i)g(i)  -  Eg(X)}  =  Kli{g{x)  -  Eg(X)} 

are  of  two  types.  If  E  g(X)  ^  0,  then 

g(x)  =  E  g(X)/(1  -  X/K  P) 

is  a  constant  (?*  0)  and  thus  X  =  0.  If  E  g(X)  =  P  g{i)  =  0,  then  X  *  Kp 
Thus,  for  excunple,  by  Serfling  (1980,  page  194),  we  have  the  result.  + 


Appendix  B.  Proof  of  §3  Results 


Proof  of  Property  3.1: 


A  A 

Define  G(x)  »  xCI-x)”  and  note  that  0  «  ^ .  G(p.)  and  that  0  »  I.GCp.) 

n  X  1  SI  X  X 


By  a  Taylor-serles  expansion. 


since  G'(x)  Is  bounded  for  0  <  x  <  1.  Now,  since 


(n+m)^'^  E  1  (p,”P.)^  =  (n+m)  \  p  q 

1  1 

<  (n+m)"^'^^  ♦  0, 


(B.1) 


A 

we  have  that  (n+m)^^^  ♦  0  in  probability.  By  Fubini's  Theorem,  we 


have  that 


«  _ .  n-rm 

I  (P4-P4)  G'(Pi)  =*  (n+m)"  I  li  G'(p  )(l(X  =1)  -p.)}. 

1  I  i  X  j-1  i  ^  J 

This,  the  central  limit  theorem  and  Slutsky's  theorem  give  the  result.  4 


Proof  of  Property  3.2; 


By  a  Taylor-serles  expansion. 


0=0  +G”{V)/2l  (p.-U)^  +  0(I  (p.-U)^), 

m  n  i  1 

A 

since  G"(x)  is  bounded  for  0  <  x  <  1  and  *  0.  Similarly  to  (B.1),  we 

A  M 

have  that  (n+m)  I^(p^-U)  +  0  in  probability.  Thus 


(n+m)(0  -0  )  -  (n+m)  G"(y)/2  I  {p,-\i)  +  o  (1). 
m  n  ^  X  p 


(B.2) 


u  *>'.**  .***  .*•  A 


!«■  JMI 


1  (p.  -y  )  =  1  ((n+m)  ^  l(X.*i))^  -  y 

i  i  j=1  ^ 

“1  -2  V 

=  (n+m)  +  2(n+m)  )  I(X.=X,  )  -  y 

j<k 

=  (n+m)  ^  (1-y)  +  (l  -  (n+m)  ^ )  U,  (B.3) 

-1 

where  U  =  ("j***!  1  “  y  is  a  U-atatistic.  As  in  the  proof  of 

j<k  ^ 

Property  2.4,  E(u|x^)  =  0  and 

("2”)  e(u|x^,X2)  =  (l(X^=X2)  -  y). 

Thus,  by  the  same  argument  as  in  the  proof  of  Property  2.4  (with  K=1),  we  have 

(n+m)  U  +p  y(X^-1)* 

This,  (B.2),  (B.3)  and  Slutsky's  Theorem  yields  the  result.  ^ 

Proof  of  Property  3.3 

2 

We  need  only  show  that  o  =0  implies  p^  =  Pj  for  each  i,j.  To  do  this 
we  construct  the  random  variable 

X  =  (n+1 ) (np^-q^)  ^  with  probability  p^,  i*1,2,... 

2  2 

Now,  it  is  easy  to  see  that  Var(X)  *  a  emd  thus,  o  “0  means  that 
(  n+ 1 )  (np^-q^  )q”  ^  =  ( n+ 1 ) ^  (p^^- (n+ 1 )  ^  ^  must  be  some  constant  C  for 

i=1,2,...  Since  the  number  of  species  exceeds  n,  we  have  <  (n+1)  ^  for 
some  i  and  C  must  be  nonpositive.  The  question  of  whether  different  Pj^  may 
satisfy  {n+1)^^Pj^  -  (n+l)  )q^  *  C  is  equivalent  to  finding  the  number  of 

roots  of 


-20- 


h(x)  ■  {n/(n+1)  -  x)x  -  C 


0  <  X  <  1 


Now,  h'(x)  -  nx””^{ (n-1)/(n+1 )-x)  is  positive  for  0  <  x  <  {n-1)/(n+1)  and  is 
negative  for  (n-1)/(n+1)  <  x  <  1.  Further,  h(0)  *  -C  and  h(1)  “  -(n+1)  ^  -  C. 
Thus,  for  -(n+1)”^  <  C  <  0  there  is  exactly  one  root  and  no  roots  for 
C  <  -(n+1)"^.  + 
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A  random  sample  is  taken  from  a  population  consisting  of  an  unknown  number 
of  distinct  species.  A  quantity  of  interest  is  the  probability  of  discovering  a 
new  species  when  an  additional  draw  from  the  population  is  made.  An  estimator 
of  this  quantity  was  introduced  by  Starr  (1979) .  We  prove  a  conjecture  of 
Starr's  that  the  estimator  is  uniformly  minimum  variance  unbiased  and  give 
various  asymptotic  properties  of  the  estimator.  A  nonparametric  maximum  likeli¬ 
hood  estimator  is  introduced  which  has  similar  asymptotic  properties.  A  Monte- 

Carlo  study  is  given  which  suggests  guidelines  for  choosing  an  estimator  under 
various  circumstances. _ ZZ _  _ I _ 
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